本文探讨了时间视频接地(TVG)的任务,在该任务中,给定未修剪的视频和查询句子,目标是在提供的自然语言查询描述的视频中识别和确定动作实例的时间界。最近的作品通过使用大型预训练的语言模型(PLM)直接编码查询来解决此任务。但是,很难隔离改进的语言表示的影响,因为这些作品还提出了视觉输入的改进。此外,这些PLM大大增加了训练TVG模型的计算成本。因此,本文研究了PLM在TVG任务中的影响,并根据适配器评估了NLP参数效率培训替代方案的适用性。我们将流行的PLM与选择现有方法和测试不同的适配器相结合,以减少其他参数的影响。我们在三个具有挑战性的数据集上的结果表明,当TVG模型对该任务进行微调时,可以从PLM中受益匪浅,并且适配器是完全微调的有效替代方法,即使它们并不适合我们的任务。具体而言,适配器有助于节省计算成本,从而使PLM集成在较大的TVG模型中,并提供与最先进模型相当的结果。最后,通过对TVG中不同类型的适配器进行基准测试,我们的结果阐明了哪种适配器最适合每个研究的情况。
translated by 谷歌翻译
卷积神经网络(CNN)的鲁棒性存在一些问题。例如,可以通过向输入中添加少量噪声来更改CNN的预测,当输入分布通过在训练过程中从未见过的转换移动时,CNN的性能会降解(例如,模糊效应)。有一些方法可以用二进制嵌入替代像素值,以解决对抗性扰动的问题,从而成功改善了鲁棒性。在这项工作中,我们将像素提出到二进制嵌入(P2BE)以提高CNN的鲁棒性。P2BE是一种可学习的二进制嵌入方法,而不是先前的手工编码的二进制嵌入方法。P2BE在训练过程中未显示的对抗性扰动和视觉损坏方面的其他二进制嵌入方法优于其他二进制嵌入方法。
translated by 谷歌翻译
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based fusion strategy for filling the masked parts. It uniformly attends to the textural information on the unmasked regions and the reference frame. Then the semantic audio information is involved in enhancing the self-attention computation. Additionally, a refinement network with audio injection improves both image and lip-sync quality. Extensive experiments validate that our model can generate high-fidelity lip-synced results for arbitrary subjects.
translated by 谷歌翻译
我们通过在轮子上的光加权外骨骼提出了一个用于低体积受损的用户的个人移动装置。在其核心上,一种新型的被动外骨骼提供姿势过渡,利用自然身体姿势,该姿势在静坐的静止和静坐(STS)过渡时,通过单个气体弹簧作为储能单元,通过支撑架上的躯干。我们通过双轮线系统提出膝盖和髋关节的方向依赖性耦合,从躯干运动转移到膝关节致动器处的力矩负载来平衡躯干运动。在这里,外骨骼最大化能量转移和用户运动的自然。我们介绍了一个体现的用户界面,用于通过躯干压力感测通过躯干压力感测,导致平均$ 19 ^ {\ rIC} \ PM 13 ^ {\ rIC} $上六个未受害的用户。我们评估了11月11日未受害的用户在过渡期间观察动作和肌肉活动的STS帮助的设计。结果比较辅助和无归档的STS转型验证了涉及的肌肉群体的显着减少(高达68美元\%$ 5,01.01 $)。此外,我们通过自然躯干倾斜运动来显示它是可行的$ + 12 ^ {\ riC} \ pm 6.5 ^ {\ circ} $和$ - 13.7 ^ {\ rIC} \ pm 6.1 ^ {\ riC} $ staity和分别坐着。被动灾害迁移援助保证进一步努力提高其适用性和扩大用户人口。
translated by 谷歌翻译
Nesterov的加速准牛顿(L)Naq方法已经显示出在几个神经网络(NN)应用中使用Nesterov的加速梯度加速了传统(L)BFGS准牛顿方法。然而,每个迭代的两个梯度的计算增加了计算成本。动量加速的准牛顿(MOQ)方法表明,Nesterov的加速梯度可以近似为过去梯度的线性组合。此摘要将MoQ近似扩展到有限的内存NAQ并评估函数近似问题的性能。
translated by 谷歌翻译
Spatially varying spectral modulation can be implemented using a liquid crystal spatial light modulator (SLM) since it provides an array of liquid crystal cells, each of which can be purposed to act as a programmable spectral filter array. However, such an optical setup suffers from strong optical aberrations due to the unintended phase modulation, precluding spectral modulation at high spatial resolutions. In this work, we propose a novel computational approach for the practical implementation of phase SLMs for implementing spatially varying spectral filters. We provide a careful and systematic analysis of the aberrations arising out of phase SLMs for the purposes of spatially varying spectral modulation. The analysis naturally leads us to a set of "good patterns" that minimize the optical aberrations. We then train a deep network that overcomes any residual aberrations, thereby achieving ideal spectral modulation at high spatial resolution. We show a number of unique operating points with our prototype including dynamic spectral filtering, material classification, and single- and multi-image hyperspectral imaging.
translated by 谷歌翻译